Error-free decoding for failure-tolerant memories

ABSTRACT

A translator for a digital memory system which performs single error correction and double error detection (SEC/DED) upon the stored word in converting it into a parity-encoded form and in addition detects circuit failures in the translator itself. The translator also takes a parity-encoded word, checks the parity encoding, translates the word into an SEC/DED form and writes it into memory. The translator consists of a syndrome generator, a single error corrector, a double error detector, a byte parity encoder, a byte parity checker and a circuit to implement a check on the parity-encoded form of the word which is read. The paritycheck matrix used in formulating the SEC/DED encoded form of the word has the following properties: Property 1: The columns of the parity check matrix are a minimum Hamming distance of 2 apart. Property 2: Each column of the parity check matrix is odd weight. Property 3: If there are r check bits C(j), m bytes with parity bits P(i), and odd parity is used, then

United States Patent Carter et al.

[1 1 3,688,265 1 Aug. 29, 1972 [54] ERROR-FREE DECODING FOR FAILURE-TOLERANT MEMORIES Primary Examiner-Charles E. Atkinson Attorney-Hanifin and Jancin and Victor Siber MAIN STORE ABSTRACT A translator for a digital memory system which performs single error correction anddouble error detection'(SEC/DED) upon the stored word in converting it into a parity-encoded form and in addition detects circuit failures in the translator itself. The translator also takes a parity-encoded word, checks the parity encoding, translates the word into an SEC/DED form and writes it into memory. The translator consists of a syndrome generator, a single error corrector, a double error detector, a byte parity encoder, a byte parity checker and a circuit to implement a check on the parity-encoded form of the word which is read. The parity-check matrix used in formulating the SEC/DED encoded form of the word has the following properties: Property 1: The columns of the parity check matrix are a minimum Hamming distance of 2 apart.

Property 2:

Each column of the parity check matrix is odd weight. Property 3:

If there are r check bits C(j), m bytes with P(i), and odd parity is used, then C(l) C(2) C(r) P(l) P(m)=(r+m) mod 2 parity bits 7Claims,17Drawing1h'gures c r Kain won) REGISTER B YTE-PARITY 1o DUAL OUTPUT BUNDLE of SYNDROME. GENERATOR )Lffiggiitt SELF-TESIABLE 1. PAIRS SiOftSfl 526E521 sm ftsn f r r a SINGLE ERROR cco SYNDROME CORRECTION CIRCUIT .TREE PAR'TY CHECK s- SID (9 SH oic=oiANni (Si'Sl F? Ci=j s R1 R0 P1 P0 COINCIDENCE K16 mi 021"" rink Cll .lCr (Reunofi iub MEMORY 0m REGISTER (WITH 0m in BYTE-PARITY FORM) PAIENTEIIwczs I972 3 688 265 SHEET UIUF 16 MAIN STORE F I G 1 V V( RD READ our I I2 01 02 Dk 01 Cr DATA WORD J J J J J \REG|STER m V W I 1o DUAL OUTPUT BUNDLE 0F SYNDROME GENERATOR SELF-TESTABLE SELF-TESTABLE PAIRS PAIRS Sr0 Sri if SINGLE ERRoR Rcco $1T1T$$E CO.RREFTION CIRCUIT TREE CHECK 17\ s SI.O@SI1I d CEP CD DIC= 0I(+)AN0I IsIs) A I I Cjc=CjANDj (Si'S) R1} R0 I COINCIDENCE \16 III D2--- Dk cI Cr CIRCUIT BYTE-PARITY i ENCODEYR MEMORY DATA REGISTER I I I' (WITH DATA IN BYTE 1 P1 BYTE'PARITY FOP-M) l E XOR TREE cIIIcuIIs 21 Q v INVENTORS WILLIAM c. CARTER DONALD c. JESSEP,JR. ROBERT A. HENLE RCCO TREE ASPI B. WADIA m BY $72234,

AGENT PAIENTEDmczs I972 sum as ur 1e OPOPOFOFOPOFOPOFOFOFOPOPOPOPOFOFOFOFOwOPOFOF 3 E 3 no 3 3 No 5 $5 mg N3 5Q 8c @2 Q2 Ea m2 02 $5 m2 N2 SQ PATENTEMuczs m2 SHEET 12UF 16 FIG. 2K

PATENTEDAM I 3.688.265

' sum 1sur1e PATENTEDNFZWWZ 3.688.265

sum IBUF 1e ERROR-FREE DECODING FOR FAILURE- TOLERANT MEMORIES The invention described herein was made in the performance of work under a NASA contract and is subject to the provisions of Section 305 of the National Aeronautics and Space Act of 1958, Public Law 85- 568 (72 Stat. 435; 42 USC 2,457).

CROSS-REFERENCE TO RELATED APPLICATIONS Reference is hereby made to application Ser. No. 747,553, now US. Pat. No. 3,559,167 of W. C. Carter, Keith A. Duke, and P. R. Schneider, filed July 25, 1968 and entitled Self-Checking Error Checker for Two- Rail Coded Data and to application Ser. No. 99,083 of W. C. Carter and P. R. Schneider filed Dec. 17, 1970 and entitled Self-Checking Error Checker for Parity Coded Data and to application Ser. No. 747,665, now US. Pat. No. 3,559,168 of W. C. Carter, K. A. Duke, and P. R. Schneider filed July 25 1968 and entitled Self-Checking Error Checker for k-Out-of-n Coded Data. These applications may be helpful for a better understanding of the principles and operation of the present application.

BACKGROUND OF THE INVENTION The present invention relates to a memory translation system that is self-checking. More specifically, it relates to a digital memory translation system providing single error correction and double error detection wherein circuit failures within the translator itself are detected.

In the present state of the data processing art, the relative unreliability of memory systems, have caused systems architects and designers to utilize error correction and detection coding for memory words. The increase in the size of memory words together with the present SEC/DED devices required of new memory techniques have increased the probability that circuit failures in the SEC/DED itself, occur with equal probability of double failures in the memory words.

Present memory systems provide error detection and correction of data errors by various techniques. Probably the most widespread is the use of parity checking wherein an extra bit or bits accompany the transmitted data bits and are utilized to indicate the correctness of the data of a particular transmission, i.e., normally the parity bit indicates whether an odd or even number of ones appear in the data transmission proper. However, for such parity checking systems, means must be provided for generating the proper parity hits at various transmission points within the computer and additional means must be provided for checking the parity.

Further advances in the art, have resulted in numerous error detection and correction codes. One class of such codes is generally known as single error correction, double error detection codes (SEC/DED). Techniques for constructing such a code may be found in Hamming, R. W., Error Detecting and Error Correcting Codes, Bell Systems Technical Journal, 29, 1950, pages 147-160.

A further technique utilized in the prior art for error detection and correction, is the development of syndromes" which indicate whether errors have occurred and which particular bit in the data segment of a particular word need be corrected. One such example, may be found in US. Pat. No. 3,478,313.

All of the above-mentioned prior art, codes and systems, while providing error detection and possibly correction of data errors occurring in a memory or storage device, are susceptible to circuit failures and errors generated in the code translator from the memory to the data registers of the processor using the information. Therefore, at present, translator systems are not self-checkable during normal processing. In other words, if the translator were subjected to failure, errors might go undetected, or correct data might be erroneously modified so that erroneous data would be provided to the machine from its memory in spite of SEC/DED codings used.

OBJECTS OF THE INVENTION Therefore, it is an object of the present invention to provide an improved memory translation circuit utilizing a new SEC/DED code decoding and encoding.

It is a further object of the present invention to provide an SEC/DED translator which detects circuit failures in the translator itself.

It is a further object of the present invention to provide a self-testable SEC/DED translator that performs the single error correction and double error detection by means of conventional circuitry.

It is a further object of the present invention to provide a translator which operates on a new SEC/DED code which provides indications as to circuit failures in the translator and double errors in the data words while correcting all single data errors created in the main store.

SUMMARY OF THE INVENTION In the present invention a self-testing translator structure having a high degree of .error control is provided for a digital memory system. Error control is accomplished by utilizing a class of codes known as single-error-correcting, double-error-detecting. The inventive system that utilizes this code provides the following features in a self-testable system:

1. When all data is correct, the self-testing system (a) implements a double data error indication circuitry; (b) implements a final data check indication circuitry; (c) generates syndrome bit pairs, and all 2' input syndrome patterns are made to appear on the syndrome lines.

2. When all data is correct, circuits and encoding means detect any circuit failure in the translator which causes an erroneous output.

3. When a single error appears in the data obtained from data store, the self-testing circuit means (a) corrects all single data errors when no circuit failures are present; (b) never generates undetected erroneous out- The columns of the parity check matrix have a minimum Hamming distance 2 apart. Property 2:

Each column of the parity check matrix is odd weight. Property 3:

Assuming that there are r check bits C(j) m bytes with parity bits P(i), and odd parity is used in the system,then

The translator built in accordance with the SEC/DED code having the above properties consists of a syndrome generator, a single error correcter, a double error detector, a byte parity encoder, a byte parity checker and a circuit to implement the check based on Property 3 of the code. In addition to these check circuits, there are two registers, the Data Word Register, which contains data in SEC/DED code, and the Memory Data Register, containing corrected data in byte parity format.

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention, as illustrated in the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a translator that provides SEC/DED plus detection of circuit failures within the translator itself as the word is read out of memory.

FIG. 2 illustrates the cooperative arrangement of FIGS. 2A-20.

FIGS. 2A-2O show a detailed circuit diagram of the translator presented in FIG. 1.

THEORY OF THE CODE A word as used in the preferred embodiment of this invention consists of 64 data bits D(l), D(2)...D(64) and eight check bits C(l), C(2)...C(8). However, it should be recognized, that the code is completely general and is not limited to such a structure.

For the purpose of describing and illustrating the invention presented herein, standard notation utilized in coding theory will be adhered to. In addition, ordinary syndromes are referred to by the notation S, and Si (s(io) e s(il in self-testing notation. In the preferred embodiment, n bits shall comprise a memory word, with k data (or message) bits D(i) and r check bits C(i). It is well recognized in the art, that basic work on SEC/DED and other codes was presented by R. W. Hamming. In the usual parity check matrix, as presented by Peterson, W. W., Error-Correcting Codes, John Wiley and Sons and MIT Press, New York, N. Y., 1961, the implementation of the row of all l 's, which correspond to the double error detection process, requires approximately twice as many inputs as the implementation of any other row. It has been found in developing the code of the preferred embodiment, that the number of inputs could be reduced by a factor of one-half by specifying each column in the parity check matrix as having an odd number of l s. In this case, the double-error detection property would be retained since a double input error would result in an even number of syndromes being 0, while a single error would cause an odd number of syndrome bits to be 0. Odd parity is the convertion used herein, so that all syndrome bits being identically l infers that no error has occurred.

In order to construct a parity check matrix, H, for a new set of codes with r check bits and k data bits, it is necessary to first choose the'(") combination of l and r-l zeros and assign the column with single 1 in the 1' place to the check bit Ci, where l s i S r. If k it is then necessary to choose k distinct combination of three -1 s and (r-3) Os at columns corresponding to the data bits. If (,f) z k then it is necessary to choose all (Q) combinations of three ls and (r-3) zeros and [k( 9] distinct combinations of five ls and (r5) zeros as columns in the H parity check matrix. Then, it is possible to continue with 5, 7, 9..., etc. bits in a column equal to 1.

In order to implement the self-checking preferred embodiment of an SEC/DED translator, the new code whose code words have a Hamming distance of 4 must have the following properties.

Property 1:

The columns of the parity check matrix have a minimum Hamming distance 2 apart. Property 2:

Each column of the parity check matrix is of odd weight. Property 3:

For r check bits C(i), and m bytes with parity bits P(i), and odd parity being used, then C( l $C(2)B...BC(r)B P( l) B...$P(m) (ri-m) mod 2 MATHEMATICAL PROOFS OF THE CODE The following proofs demonstrate all of the self-testing and self-checking capabilities of the circuits utilized in the preferred embodiment.

The sum mod 2 (Exclusive-OR or XOR) of the syndromes is r mod 2. If the syndrome equations are XORd, then from property 2 of the parity check matrix, each C(j) appears once and each D(i) an odd number of times, so

D( l) BD(2)o...eD(k) ea C(1)s ...oC(r)=r mod 241 Forming the XOR of all outputs bytes,

D(l BD(2)e...eD(k)ea P( 1 )e...eaP(m) =m mod 2 2) Then by forming the XOR of equation 1) and (2), property 3 of the code may be stated as Now, let p min(p,), where p, is the number of ls in the i" row of the parity check matrix (the notation i underneath the word min represents the minimum over all values of i). Then, the following algorithm will select subsets of the set Di [D(i ),...,D(i of data bits that have ls in the i" row of the parity check matrix so that [(s(lO), s( l 1 (s(20), s(2l )),...,(s(r0), s(rl will take on all 2 values, and the syndrome generator will have minimum circuit delay.

Step 1. Choose any of [p/2 elements of D1 and form s( 10) as an XOR of these elements.

Step 2. Choose [p/2] elements of D( r) with one element different from the set chosen in Step 1, and form s( 20) as XOR of these elements.

Step r. Choose p/2] elements of D(r) with one element different from the elements of the union of sets chosen in Step 1, Step 2,..., Step (r-l) and form s(rO) as XOR of these elements.

If the required choice at any stage cannot be made, the process is repeated from Step 1, again choosing in each step p/2ll elements. If again unsuccessful, the process is repeated choosing [p/2]2 elements, and so on. Note that the ultimate choice of a single element in each step will always work. Since the r sets are linearly independent, the s(i0)s take on all 2" values and hence so will [(s( s(l 1)), (s(20), s(2l )),...,(s(r0), s(rl)) Choosing the number of bits involved in the generation of s(iO) as near to' p/2] as possible gives minimum circuit delay.

Now referring to FIG. 1, the outputs of the RCCO tree and the syndrome parity check matrix [(R(0), R(l)), (P(0), 'P(l))], respectively take on all four values [(0,1), (0,1)], [(0,1), (1,0)], [(1,0), (0,1)] and 1,0), 1,0)]. This is proven by examining where C( l) and C(2) are Boolean functions independent of s(lO) and s(l 1) and C(1) e C(2)=1 during normal operation. First choose values of s(10) which make P(0) 0, and R(0) =E(a is either 0 or 1). Then, change the value of s(lO). Now, P(0) 0, R(0) a. The argument is then repeated with P(O) l.

The following proves the testability of the singleerror-corrector circuit lines. The implementation equations in the single-error corrector are:

Si =s(l0)$s(l) i=l,2,...,r si =50) s(i) i=l,2,..,r D(ic) D(i)$ANDi (T) i= 1, 2, k C(jc) =C(i)G9ANDj (T) j= 1, 2, r where ANDi is the AND gate into which T feeds, and T is an r-dimensional vector with an odd number of elements from [S(i), S(2),...,S(r)] and its remaining elements from [S(i), S(2),...,S(r)] as determined by a column in the parity check matrix H. Also, S(i) is one of the elements if S(i) is not. For example, S( 1)S (2)S(3 )S(4) and S( l )(2)S(3)S(4) would be valid AN- Di(T)s for r= 4.

The investigation of testability of the lines S(i), S(i), AN Di i 1, 2,...,(k+r) is carried out as follows. At first testability is examined in code space. Then, testability in single error space is examined with concentration on those failures which were undetectable in code space. Having obtained the set of single failures which are not detected either in code space or single error space, it is determined if further accumulation of undetected failures is possible. From this examination, it is determined that a failure in the single error corrector circuit is detected if the byte parity bits generated from the corrected data bits and the corrected check bits do not satisfy Property 3.

Testability in Code Space In code space S(i), S(i) and ANDi never take on the value 0, l and 1, respectively and therefore, the failures S(i) stuck-at-l (s-a-l S(i) stuck-at-O (s-a-O) and ANDi stuck-at-O cannot be detected.

S(i) s-a-O: For the error to originate, it is required that S(i) should actually be 1, so S(i) 0 and no correction conjunct ANDj (T) must either contain S(i) or S(i). None of the AND gates are activated and thus no erroneous correction results. The failure is undetected, and no error is introduced.

S(i) s-a-l: The AND gate corresponding to the correction conjunct S( l )S(2)...S(i l) S(i)S(i+l )...S(r) is erroneously activated causing a check-bit to be erroneously corrected, so the check output E( 1) is wrong. Since all data bits are correct, the output E(0) is unchanged, and the-pair (E(0), E(1)) signal an error. Where (E(0), E( l is a self-testing pair of output lines that indicate a failure in the single error corrector circuit or a single failure in the circuit between the single error corrector and the lines providing the indication E.

ANDi s-a-l: The said AND gate is erroneously activated so one data bit and hence one byte parity bit is wrong or (exclusively) one check bit is wrong, and in either case the pair (E(0), E( 1)) will signal an error.

Hence, the failures undetected in code space are S(i) s-a-O, S(i) s-a-l ,S(i) s-a-O and ANDi s-a-O.

Testability in Single Error Space From Property 1 of the code, over single error -space a single error in S(i) or S(i) will only change one bit of a column and hence will not change one column of the parity check matrix into another.

Consider the failures undetected in code space.

S(i) s-a-0: For an error to originate, S(i) 1, Si s-a-O and S(i) 0 will inactivate all ANDgates and no correction is made. One bit of the presumably corrected word is in error, and this will be detected by the check P l S(i) s-a-0: For an error to originate, S(i) 1 so S(i) 0. The proof is similar to that for S(i) s-a-O and the error is detected.

ANDi s-a-O: When the single error pattern that should activate this AND gate occurs, no correction will be made. The above proofs show this error will be detected.

S(i) s-a-l: For an error to originate, S(i) 0, so S(i) 1. Such a pattern will be one that, activates an AND gate corresponding to a T that contains S(i) and thus not S( i). The error thus does not effect this AND gate and the right correction is made. No erroneous correction is made and the failure is undetected. The output word is correct.

The only single failure that is undetected both in code space and single error space is S(i) s -a-l, so failures in S(i) s-a-l can accumulate.

Testability of Various Lines of the Single-Error-Cor-- rector Given that One S(i) is s-a-l For testability in code space, the arguments follow I through just as discussed above with the same result viz. undetected failures are S(i) (i a j) s-a-O, S(i) (i j) s-a-l S(i) s-a-O and ANDi s-a-O.

For testability in single error space, the arguments follow through just as for S(i) s-a-O, S(i) s-a-0, and ANDi s-a-O.

S(i) (i v j) s-a-l: The single error pattern that can cause errors to originate at both sites of failures must have S(i) =S(i) =0, so S(i) =1 and S(i) 1. The AND gate that should be activated by this pattern must correspond to a T( l) which contains S(i) and S(i) and not S(i) or S(j) and in this manner the right correction is 

1. A self-testing translator system for translating coded data obtained from a computer main store, and for placing the resulting decoded data in a memory data register for subsequent processing, said system comprising: data word register means connected to said main store for receiving and holding a coded data word consisting of data and check bits; syndrome generator means connected to said data word register for providing a plurality of self-testing output line pairs that represent syndromes of said data words; single error correction means connected to said data word register and to said syndrome generator for correcting all single errors which are present in said data word or in the coded check bits associated with said data word and for outputting correct data and check bits; double error detection means connected to said syndrome generator means for detecting a double error in said coded data word; parity encoding means connected to said single error correction means for allowing detection of errors in said correction means; memory data register means connected to said parity encoder means for storing translated data words; parity check means connected to said memory data register means for checking that the decoded data in said memory data register means is correct.
 2. The system as defined in claim 1 wherein said double error detection means further comprises reduction circuit means for accepting a plurality of self-testable line pairs generated by said syndrome generator; said reduction circuit providing a single set of self-testing output lines which in error space indicate a minimum of one error occurring in said data word; syndrome parity check means for checking the even or odd state of said syndromes for determining a double error in said data word and providing a single set of self-testing output lines which in error space indicate a double error in said data word; coincidence means for reducing the outputs from said reduction circuit means and said syndrome parity check means to a single pair of self-testing output lines which in error space indicate detection of a double error.
 3. The system as defined in claim 2 wherein said single error correction means comprises: means for generating single line syndromes S(i) s(i0) + s(i1) and its negation S(i) S(i0) + S(i1); means for correcting data bit errors by executing the function D(iC) D(i) + ANDi(Si''s); means for correcting check bit error by executing the function C(jC) C(j) + ANDj (Si''s).
 4. The system as defined in claim 3 further comprising final output reduction means connected to said parity check means for reducing the plurality of self-testing output line pairs from said parity check means into a single set of self-testing output lines for indicating that an error occurred between said memory data register means and said final output reduction means.
 5. The system as defined in claim 4 wherein said syndrome generator means comprises a first set of exclusive OR circuits; a second set of exclusive OR circuits; said first exclusive OR circuits forming a plurality of syndromes for subsections of said data word; said second exclusive OR circuits independently forming a plurality of inverted syndromes for subsections of said data word.
 6. The system as defined in claim 4 further comprising: correction circuit check means connected to said single error correction means for determining that an erroneous correction of a data or check bit has occurred.
 7. The system as defined in claim 6 wherein said correction circuit check means further comprises: a first exclusive OR tree circuit connected to the plurality of parity bit output lines of said parity encoding means for reducing said plurality of parity lines to single output line that forms the first half of a self-testable output line pair; a second exclusive OR tree circuit connected to the plurality of corrected check bit output lines of said single error correction means for reducing said plurality of check bit lines to a single output line that forms the second half of a self-testable output line pair. 